Overview

Dataset statistics

Number of variables9
Number of observations20640
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory1.4 MiB
Average record size in memory72.0 B

Variable types

Numeric9

Warnings

Latitude is highly correlated with LongitudeHigh correlation
Longitude is highly correlated with LatitudeHigh correlation
AveRooms is highly skewed (γ1 = 20.69786896) Skewed
AveBedrms is highly skewed (γ1 = 31.31695625) Skewed
AveOccup is highly skewed (γ1 = 97.63956096) Skewed

Reproduction

Analysis started2021-04-20 08:41:34.546695
Analysis finished2021-04-20 08:42:05.594776
Duration31.05 seconds
Software versionpandas-profiling v2.11.0
Download configurationconfig.yaml

Variables

MedInc
Real number (ℝ≥0)

Distinct12928
Distinct (%)62.6%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.870671003
Minimum0.4999
Maximum15.0001
Zeros0
Zeros (%)0.0%
Memory size161.4 KiB

Quantile statistics

Minimum0.4999
5-th percentile1.60057
Q12.5634
median3.5348
Q34.74325
95-th percentile7.300305
Maximum15.0001
Range14.5002
Interquartile range (IQR)2.17985

Descriptive statistics

Standard deviation1.899821718
Coefficient of variation (CV)0.4908249026
Kurtosis4.952524102
Mean3.870671003
Median Absolute Deviation (MAD)1.0642
Skewness1.646656702
Sum79890.6495
Variance3.60932256
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
15.000149
 
0.2%
3.12549
 
0.2%
2.87546
 
0.2%
2.62544
 
0.2%
4.12544
 
0.2%
3.87541
 
0.2%
3.37538
 
0.2%
338
 
0.2%
437
 
0.2%
3.62537
 
0.2%
Other values (12918)20217
98.0%
ValueCountFrequency (%)
0.499912
0.1%
0.53610
< 0.1%
0.54951
 
< 0.1%
0.64331
 
< 0.1%
0.67751
 
< 0.1%
0.68251
 
< 0.1%
0.68311
 
< 0.1%
0.6961
 
< 0.1%
0.69911
 
< 0.1%
0.70071
 
< 0.1%
ValueCountFrequency (%)
15.000149
0.2%
152
 
< 0.1%
14.90091
 
< 0.1%
14.58331
 
< 0.1%
14.42191
 
< 0.1%
14.41131
 
< 0.1%
14.29591
 
< 0.1%
14.28671
 
< 0.1%
13.9471
 
< 0.1%
13.85561
 
< 0.1%

HouseAge
Real number (ℝ≥0)

Distinct52
Distinct (%)0.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean28.63948643
Minimum1
Maximum52
Zeros0
Zeros (%)0.0%
Memory size161.4 KiB

Quantile statistics

Minimum1
5-th percentile8
Q118
median29
Q337
95-th percentile52
Maximum52
Range51
Interquartile range (IQR)19

Descriptive statistics

Standard deviation12.58555761
Coefficient of variation (CV)0.4394477408
Kurtosis-0.8006288536
Mean28.63948643
Median Absolute Deviation (MAD)10
Skewness0.0603306376
Sum591119
Variance158.3962604
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
521273
 
6.2%
36862
 
4.2%
35824
 
4.0%
16771
 
3.7%
17698
 
3.4%
34689
 
3.3%
26619
 
3.0%
33615
 
3.0%
18570
 
2.8%
25566
 
2.7%
Other values (42)13153
63.7%
ValueCountFrequency (%)
14
 
< 0.1%
258
 
0.3%
362
 
0.3%
4191
0.9%
5244
1.2%
6160
0.8%
7175
0.8%
8206
1.0%
9205
1.0%
10264
1.3%
ValueCountFrequency (%)
521273
6.2%
5148
 
0.2%
50136
 
0.7%
49134
 
0.6%
48177
 
0.9%
47198
 
1.0%
46245
 
1.2%
45294
 
1.4%
44356
 
1.7%
43353
 
1.7%

AveRooms
Real number (ℝ≥0)

SKEWED

Distinct19392
Distinct (%)94.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean5.428999742
Minimum0.8461538462
Maximum141.9090909
Zeros0
Zeros (%)0.0%
Memory size161.4 KiB

Quantile statistics

Minimum0.8461538462
5-th percentile3.432330006
Q14.440716236
median5.229128788
Q36.052380952
95-th percentile7.640246547
Maximum141.9090909
Range141.0629371
Interquartile range (IQR)1.611664716

Descriptive statistics

Standard deviation2.474173139
Coefficient of variation (CV)0.4557327789
Kurtosis879.353264
Mean5.428999742
Median Absolute Deviation (MAD)0.8029206051
Skewness20.69786896
Sum112054.5547
Variance6.121532724
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
527
 
0.1%
4.522
 
0.1%
421
 
0.1%
620
 
0.1%
5.33333333313
 
0.1%
5.511
 
0.1%
4.6666666679
 
< 0.1%
38
 
< 0.1%
5.6666666678
 
< 0.1%
77
 
< 0.1%
Other values (19382)20494
99.3%
ValueCountFrequency (%)
0.84615384621
< 0.1%
0.88888888891
< 0.1%
11
< 0.1%
1.1304347832
< 0.1%
1.2608695651
< 0.1%
1.3784860561
< 0.1%
1.4112903231
< 0.1%
1.4657534251
< 0.1%
1.5504087191
< 0.1%
1.5530303031
< 0.1%
ValueCountFrequency (%)
141.90909091
< 0.1%
132.53333331
< 0.1%
62.422222221
< 0.1%
61.81251
< 0.1%
59.8751
< 0.1%
56.269230771
< 0.1%
52.848214291
< 0.1%
52.690476191
< 0.1%
50.837837841
< 0.1%
47.515151521
< 0.1%

AveBedrms
Real number (ℝ≥0)

SKEWED

Distinct14233
Distinct (%)69.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1.09667515
Minimum0.3333333333
Maximum34.06666667
Zeros0
Zeros (%)0.0%
Memory size161.4 KiB

Quantile statistics

Minimum0.3333333333
5-th percentile0.9391087789
Q11.006079046
median1.048780488
Q31.099526066
95-th percentile1.273005718
Maximum34.06666667
Range33.73333333
Interquartile range (IQR)0.09344702031

Descriptive statistics

Standard deviation0.4739108568
Coefficient of variation (CV)0.4321342167
Kurtosis1636.711972
Mean1.09667515
Median Absolute Deviation (MAD)0.04608687412
Skewness31.31695625
Sum22635.37509
Variance0.2245915002
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
1288
 
1.4%
1.12529
 
0.1%
1.05882352926
 
0.1%
1.08333333325
 
0.1%
1.125
 
0.1%
1.05263157923
 
0.1%
1.09090909121
 
0.1%
1.0521
 
0.1%
1.05555555620
 
0.1%
1.11111111118
 
0.1%
Other values (14223)20144
97.6%
ValueCountFrequency (%)
0.33333333331
 
< 0.1%
0.3751
 
< 0.1%
0.44444444441
 
< 0.1%
0.53
< 0.1%
0.52631578951
 
< 0.1%
0.531251
 
< 0.1%
0.54545454551
 
< 0.1%
0.55555555561
 
< 0.1%
0.56252
< 0.1%
0.57142857142
< 0.1%
ValueCountFrequency (%)
34.066666671
< 0.1%
25.636363641
< 0.1%
15.31251
< 0.1%
14.111111111
< 0.1%
11.410714291
< 0.1%
11.181818181
< 0.1%
111
< 0.1%
10.270270271
< 0.1%
10.153846151
< 0.1%
9.7037037041
< 0.1%

Population
Real number (ℝ≥0)

Distinct3888
Distinct (%)18.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1425.476744
Minimum3
Maximum35682
Zeros0
Zeros (%)0.0%
Memory size161.4 KiB

Quantile statistics

Minimum3
5-th percentile348
Q1787
median1166
Q31725
95-th percentile3288
Maximum35682
Range35679
Interquartile range (IQR)938

Descriptive statistics

Standard deviation1132.462122
Coefficient of variation (CV)0.7944444737
Kurtosis73.55311639
Mean1425.476744
Median Absolute Deviation (MAD)440
Skewness4.935858227
Sum29421840
Variance1282470.457
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
89125
 
0.1%
76124
 
0.1%
122724
 
0.1%
85024
 
0.1%
105224
 
0.1%
82523
 
0.1%
99922
 
0.1%
78222
 
0.1%
100522
 
0.1%
78121
 
0.1%
Other values (3878)20409
98.9%
ValueCountFrequency (%)
31
 
< 0.1%
51
 
< 0.1%
61
 
< 0.1%
84
< 0.1%
92
< 0.1%
111
 
< 0.1%
134
< 0.1%
143
< 0.1%
152
< 0.1%
172
< 0.1%
ValueCountFrequency (%)
356821
< 0.1%
285661
< 0.1%
163051
< 0.1%
161221
< 0.1%
155071
< 0.1%
150371
< 0.1%
132511
< 0.1%
128731
< 0.1%
124271
< 0.1%
122031
< 0.1%

AveOccup
Real number (ℝ≥0)

SKEWED

Distinct18841
Distinct (%)91.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.070655159
Minimum0.6923076923
Maximum1243.333333
Zeros0
Zeros (%)0.0%
Memory size161.4 KiB

Quantile statistics

Minimum0.6923076923
5-th percentile1.872544757
Q12.429741148
median2.818115654
Q33.282260924
95-th percentile4.333416667
Maximum1243.333333
Range1242.641026
Interquartile range (IQR)0.8525197767

Descriptive statistics

Standard deviation10.38604956
Coefficient of variation (CV)3.382356215
Kurtosis10651.01064
Mean3.070655159
Median Absolute Deviation (MAD)0.4195255877
Skewness97.63956096
Sum63378.32249
Variance107.8700255
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
335
 
0.2%
218
 
0.1%
2.517
 
0.1%
2.66666666716
 
0.1%
2.33333333313
 
0.1%
2.612
 
0.1%
3.211
 
0.1%
2.55555555610
 
< 0.1%
2.759
 
< 0.1%
3.259
 
< 0.1%
Other values (18831)20490
99.3%
ValueCountFrequency (%)
0.69230769231
< 0.1%
0.751
< 0.1%
0.97058823531
< 0.1%
1.0606060611
< 0.1%
1.0661764711
< 0.1%
1.0892678031
< 0.1%
1.0892857141
< 0.1%
1.1612903231
< 0.1%
1.1693290731
< 0.1%
1.2158730161
< 0.1%
ValueCountFrequency (%)
1243.3333331
< 0.1%
599.71428571
< 0.1%
502.46153851
< 0.1%
230.17241381
< 0.1%
83.171428571
< 0.1%
63.751
< 0.1%
51.41
< 0.1%
41.214285711
< 0.1%
33.952941181
< 0.1%
21.333333331
< 0.1%

Latitude
Real number (ℝ≥0)

HIGH CORRELATION

Distinct862
Distinct (%)4.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean35.63186143
Minimum32.54
Maximum41.95
Zeros0
Zeros (%)0.0%
Memory size161.4 KiB

Quantile statistics

Minimum32.54
5-th percentile32.82
Q133.93
median34.26
Q337.71
95-th percentile38.96
Maximum41.95
Range9.41
Interquartile range (IQR)3.78

Descriptive statistics

Standard deviation2.135952397
Coefficient of variation (CV)0.05994501302
Kurtosis-1.117759781
Mean35.63186143
Median Absolute Deviation (MAD)1.23
Skewness0.4659530037
Sum735441.62
Variance4.562292644
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
34.06244
 
1.2%
34.05236
 
1.1%
34.08234
 
1.1%
34.07231
 
1.1%
34.04221
 
1.1%
34.09212
 
1.0%
34.02208
 
1.0%
34.1203
 
1.0%
34.03193
 
0.9%
33.93181
 
0.9%
Other values (852)18477
89.5%
ValueCountFrequency (%)
32.541
 
< 0.1%
32.553
 
< 0.1%
32.5610
 
< 0.1%
32.5718
0.1%
32.5826
0.1%
32.5911
0.1%
32.69
 
< 0.1%
32.6114
0.1%
32.6213
0.1%
32.6318
0.1%
ValueCountFrequency (%)
41.952
< 0.1%
41.921
 
< 0.1%
41.881
 
< 0.1%
41.863
< 0.1%
41.841
 
< 0.1%
41.821
 
< 0.1%
41.812
< 0.1%
41.83
< 0.1%
41.791
 
< 0.1%
41.783
< 0.1%

Longitude
Real number (ℝ)

HIGH CORRELATION

Distinct844
Distinct (%)4.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean-119.5697045
Minimum-124.35
Maximum-114.31
Zeros0
Zeros (%)0.0%
Memory size161.4 KiB

Quantile statistics

Minimum-124.35
5-th percentile-122.47
Q1-121.8
median-118.49
Q3-118.01
95-th percentile-117.08
Maximum-114.31
Range10.04
Interquartile range (IQR)3.79

Descriptive statistics

Standard deviation2.003531724
Coefficient of variation (CV)-0.01675618195
Kurtosis-1.330152366
Mean-119.5697045
Median Absolute Deviation (MAD)1.28
Skewness-0.297801208
Sum-2467918.7
Variance4.014139367
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
-118.31162
 
0.8%
-118.3160
 
0.8%
-118.29148
 
0.7%
-118.27144
 
0.7%
-118.32142
 
0.7%
-118.28141
 
0.7%
-118.35140
 
0.7%
-118.36138
 
0.7%
-118.19135
 
0.7%
-118.25128
 
0.6%
Other values (834)19202
93.0%
ValueCountFrequency (%)
-124.351
 
< 0.1%
-124.32
 
< 0.1%
-124.271
 
< 0.1%
-124.261
 
< 0.1%
-124.251
 
< 0.1%
-124.233
< 0.1%
-124.221
 
< 0.1%
-124.213
< 0.1%
-124.194
< 0.1%
-124.186
< 0.1%
ValueCountFrequency (%)
-114.311
 
< 0.1%
-114.471
 
< 0.1%
-114.491
 
< 0.1%
-114.551
 
< 0.1%
-114.561
 
< 0.1%
-114.573
< 0.1%
-114.582
< 0.1%
-114.592
< 0.1%
-114.63
< 0.1%
-114.613
< 0.1%

MedHouseVal
Real number (ℝ≥0)

Distinct3842
Distinct (%)18.6%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean2.068558169
Minimum0.14999
Maximum5.00001
Zeros0
Zeros (%)0.0%
Memory size161.4 KiB

Quantile statistics

Minimum0.14999
5-th percentile0.662
Q11.196
median1.797
Q32.64725
95-th percentile4.8981
Maximum5.00001
Range4.85002
Interquartile range (IQR)1.45125

Descriptive statistics

Standard deviation1.153956159
Coefficient of variation (CV)0.55785531
Kurtosis0.3278702429
Mean2.068558169
Median Absolute Deviation (MAD)0.684
Skewness0.9777632739
Sum42695.04061
Variance1.331614816
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
5.00001965
 
4.7%
1.375122
 
0.6%
1.625117
 
0.6%
1.125103
 
0.5%
1.87593
 
0.5%
2.2592
 
0.4%
3.579
 
0.4%
0.87578
 
0.4%
2.7565
 
0.3%
1.564
 
0.3%
Other values (3832)18862
91.4%
ValueCountFrequency (%)
0.149994
< 0.1%
0.1751
 
< 0.1%
0.2254
< 0.1%
0.251
 
< 0.1%
0.2661
 
< 0.1%
0.2691
 
< 0.1%
0.2751
 
< 0.1%
0.2831
 
< 0.1%
0.32
< 0.1%
0.3254
< 0.1%
ValueCountFrequency (%)
5.00001965
4.7%
527
 
0.1%
4.9911
 
< 0.1%
4.991
 
< 0.1%
4.9881
 
< 0.1%
4.9871
 
< 0.1%
4.9861
 
< 0.1%
4.9841
 
< 0.1%
4.9761
 
< 0.1%
4.9741
 
< 0.1%

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

Sample

First rows

MedIncHouseAgeAveRoomsAveBedrmsPopulationAveOccupLatitudeLongitudeMedHouseVal
08.325241.06.9841271.023810322.02.55555637.88-122.234.526
18.301421.06.2381370.9718802401.02.10984237.86-122.223.585
27.257452.08.2881361.073446496.02.80226037.85-122.243.521
35.643152.05.8173521.073059558.02.54794537.85-122.253.413
43.846252.06.2818531.081081565.02.18146737.85-122.253.422
54.036852.04.7616581.103627413.02.13989637.85-122.252.697
63.659152.04.9319070.9513621094.02.12840537.84-122.252.992
73.120052.04.7975271.0618241157.01.78825337.84-122.252.414
82.080442.04.2941181.1176471206.02.02689137.84-122.262.267
93.691252.04.9705880.9901961551.02.17226937.84-122.252.611

Last rows

MedIncHouseAgeAveRoomsAveBedrmsPopulationAveOccupLatitudeLongitudeMedHouseVal
206303.567311.05.9325841.1348311257.02.82471939.29-121.321.120
206313.517915.06.1458331.1412041200.02.77777839.33-121.401.072
206323.125015.06.0233771.0805191047.02.71948139.26-121.451.156
206332.549527.05.4450261.0785341082.02.83246139.19-121.530.983
206343.712528.06.7790701.1482561041.03.02616339.27-121.561.168
206351.560325.05.0454551.133333845.02.56060639.48-121.090.781
206362.556818.06.1140351.315789356.03.12280739.49-121.210.771
206371.700017.05.2055431.1200921007.02.32563539.43-121.220.923
206381.867218.05.3295131.171920741.02.12320939.43-121.320.847
206392.388616.05.2547171.1622641387.02.61698139.37-121.240.894